Data Engineer Resume
New York, NY
SUMMARY
- 8+ years of IT experience includes in BigData, Data Science (Machine Learning, Deep Learning, NLP/ Text Mining), Data/Business Analytics, Data Visualization, Data Operations, and BI.
- Proficient at wide varieties of Data Science programming languages Python, R, SQL, PySpark, Sci - kit Learn, NumPy, SciPy and Pandas, NLTK, TextBlob, Genism, SpaCy, Keras and TensorFlow.
- Excellent understanding of Hadoop architecture and complete understanding of Hadoop-Daemons and various components such as HDFS, YARN, Resource Manager, Node Manager, Name Node, Data Node and Map Reduce programming paradigm.
- Experience exclusively on Big Data Ecosystem using HADOOP framework and related technologies such as HDFS, MapReduce, HIVE, PIG, HBASE, STORM, YARN, OOZIE, SQOOP, AirFlow and Zookeeper and also includes working experience in Spark Core, Spark SQL, Spark Streaming, Scala and Kafka.
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
- Installed and configured apache airflow for workflow management and created workflows in python.
- Experienced in facilitating the entire lifecycle of a data science project: Data Cleaning, Data Extraction, Data Pre-Processing, Dimensionality Reduction, Algorithm implementation, Back Testing and Validation.
- Expert in Machine Learning algorithms such as Ensemble Methods (Random forests), Linear, Polynomial, Logistic Regression, Regularized Linear Regression, Support Vector Machines (SVM), Deep Neural Networks, Extreme Gradient Boosting, Decision Trees, K-Means, K-NN, Gaussian Mixture Models, Naive Bayes.
- Experienced in working with Datasets, Spark-SQL, Data Frames, RDD's, handling large data frames using Partitions, Spark in-Memory capabilities, Effective & efficient Joins, Broadcast Variables, User Defined Functions (UDFs), User Defined Aggregated Functions (UDAFs), actions, transformations and other during ingestion process itself.
- Strong knowledge of Agile PLM, scrum processes and experience in redesign of product development process as per the business requirement.
- Expertise in PLM(Product Life Cycle Management), MDM (Master Data Management).
- Experience in converting Hive/SQL queries into RDD transformations in spark environment using Scala and Python.
- Well versed with dealing with Structured and Unstructured data, Time Series data and statistical methodologies like Hypothesis Testing, ANOVA, multivariate statistics, modeling, decision theory and time-series analysis.
- Proficient in Data transformations using log, square-root, reciprocal, cube root, square and complete box-cox transformation depending upon the dataset.
- Experience with relational and non-relational databases such as MySQL, SQL, Oracle, MongoDB, Cassandra and PostgreSQL.
- Adroit at employing various Data Visualization tools like Tableau, Matplotlib, Seaborn, ggplot2, and Plotly.
- Hands on experience in AWS Cloud in various AWS services such as Redshift cluster, Route 53 domain configuration.
- Experience on practical implementation of cloud-specific AWS technologies including IAM, Amazon Cloud Services like Elastic Compute Cloud (EC2), Simple Storage Services (S3), Virtual Private Cloud (VPC), Lambda, EBS, and EMR.
- Performed ETL data translation using informatica of functional requirements to Source to Target Data Mapping documents to support large datasets (Big Data) out to the AWS Cloud databases; Snowflake
- Proficient with container systems like Docker and container orchestration like EC2 Container Service, Kubernetes, worked with Terraform.
- Managed Docker orchestration and Docker containerization using Kubernetes.
- Used Kubernetes to orchestrate the deployment, scaling and management of Docker Containers.
- Technical Team Lead responsible for identifying, developing the Business Functions List and Enterprise-wide Data Dictionary that supports the DCPS database spiral development deliverables to the Social Security Administration (SSA) customer.
- Served as the Snowflake Database Administrator responsible for leading the data model design and database migration deployment production releases to endure our database objects and corresponding metadata were successfully implemented to the production platform environments; (Dev, Qual and Prod) AWS Cloud (Snowflake).
- Expertise in building, publishing customized interactive reports and dashboards with customized parameters and user - filters using Tableau.
- Experience with complex Data processing pipelines, including ETL and Data ingestion dealing with unstructured and semi-structured Data.
- Good communication and presentation skills, willing to learn, adapt to new technologies and third-party products.
TECHNICAL SKILLS
Big Data Technologies: HDFS, Map Reduce Pig, Hive, Sqoop, Oozie, Scala, Kafka, Ambari, Hue
Hadoop/Spark Ecosystem: Hadoop, HDFS, MapReduce, Hive, HBase, Spark, impala, Cloudera, and Hortonworks HDP, Spark Core, Spark SQL, NIFI, Sqoop, Kafka, Spark-Streaming.
Schema: Snowflake, Teradata.
Programming Languages: Python, Scala, Java, PL/SQL, SQL, Linux Shell Sheets
Database: Oracle, MS SQL Server, My SQL, PostgreSQL
Cloud: AWS, Azure
AWS: S3, EMR, EC2, Glue, ELB
Tools: Jenkins, Maven, ANT
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential, New York, NY
Responsibilities:
- Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in Near real time and persist it to Cassandra.
- Developed Kafka consumer's API in Scala for consuming data from Kafka topics.
- Consumed XML messages using Kafka and processed the xml file using Spark Streaming to capture UI updates.
- Developed Pre-processing job using Spark Data frames to flatten Json documents to flat file.
- Load D-Stream data into Spark RDD and do in memory data Computation to generate Output response.
- Involved in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Worked on AWS Cloud services like EC2, S3, EBS, RDS and VPC.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Involved in Maintaining the Hadoop cluster on AWS EMR.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
- Implemented Elastic Search on Hive data warehouse platform.
- Worked with ELASTIC MAPREDUCE and setup Hadoop environment in AWS EC2 Instances.
- Good understanding of Cassandra architecture, replication strategy, gossip, snitch etc.
- Designed Columnar families in Cassandra and Ingested data from RDBMS, performed data transformations, and then exported the transformed data to Cassandra as per the business requirement.
- Wrote Python scripts to process semi-structured data in formats like JSON.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Created map reduce jobs using python scripts that can perform ETL jobs.
- After running ETL queries performed validation check to report to client at every stage of project.
- Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.
- Create external tables with partitions using Hive, AWS Athena and Redshift.
- Initially migrated existing MapReduce programs to spark model using Python.
- Used the Spark DataStax Cassandra Connector to load data to and from Cassandra.
- Created clusters on IIS web servers using Network load balancing and managed net scale clusters including configuring clusters in global traffic management
- Tested the cluster Performance using Cassandra-stress tool to measure and improve the Read/Writes.
- Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic.
- Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds.
- Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for Data analysis and engineering type of roles.
- Using Avro, Parquet, RCFile and JSON file formats, developed UDFs in Hive and Pig.
- Worked with Log4j framework for logging debug, info & error data.
- Performed transformations like event joins, filter bot traffic and some pre-aggregations using PIG.
- Developed Custom Loaders and Storage Classes in PIG to work on several data formats like JSON, XML, CSV and generated Bags for processing using pig etc.
- Used Amazon DynamoDB to gather and track the event-based metrics.
- Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE.
- Developed Oozie coordinators to schedule Pig and Hive scripts to create Data pipelines.
- Written several Map reduce Jobs using Java API, also Used Jenkins for Continuous integration.
- Setting up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Solid Experience in developing Scope/Vision Documentation and Project Plan.
- Strong Knowledge of SDLC, RUP methodology, and Project life cycles
- Generated various kinds of reports using Power BI and Tableau based on Client specification.
- Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
- Worked on NiFi data Pipeline to process large set of data and configured Lookup’s for Data Validation and Integrity.
- Responsible for generating actionable insights from complex data to drive real business results for various application teams and worked in Agile Methodology projects extensively.
- Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, EC2, S3, Redshift, Glue, MapR, HDFS, Hive, Pig, Apache Kafka, Sqoop, Java, Python, Scala, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, MySQL, Soap, NIFI, Cassandra and Agile Methodologies, used SDLC and RUP for PLM management.
Data Engineer
Confidential, New York, NY
Responsibilities:
- Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR and MapR (MapR data platform).
- Developed Simple to complex Map/reduce streaming jobs using Python, Hive and Pig.
- Used various compression mechanisms to optimize Map/Reduce Jobs to use HDFS efficiently.
- Used ETL component Sqoop to extract the data from MySQL and load data into HDFS.
- Performed ETL processes from the business data and created a spark pipeline that can efficiently perform ETL process.
- Created map reduce jobs that can perform entire ETL process
- Wrote Hive queries and Pig scripts to study customer behavior by analyzing the data.
- Loaded data into Hive tables from Hadoop Distributed File System (HDFS) to provide SQL-like access on Hadoop data.
- Great expose to Unix scripting and good hands on shell scripting.
- Wrote Python scripts to process semi-structured data in formats like JSON.
- Involved in loading and transforming of large sets of structured, semi structured and unstructured data.
- Troubleshooting and finding the bugs in the Hadoop applications and to clear off all the bugs took help from the testing team.
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
- Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential.
- Used Python API by developing Kafka producer, consumer for writing Avro Schemes.
- Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.
- Developed the Pysprk code for AWS Glue jobs and for EMR.
- Installed Ganglia Monitoring Tool to generate reports related to Hadoop cluster like CPUs running, Hosts Up and Down etc., operations were performed to maintain Hadoop cluster.
- Responsible for analysing and data cleaning using Spark SQL Queries.
- Handled importing of data from various data sources performed transformations using spark and loaded data into hive.
- Worked with spark core, Spark Streaming and Spark SQL modules of Spark.
- Used Scala to write the code for all the use cases in Spark and extensive experience with Scala for data analytics on Spark cluster and Performed map-side joins on RDD.
- Exploring with Spark various modules of Spark and working with Data Frames, RDD and Spark Context.
- On demand, secure EMR launcher with custom Spark submit steps using S3 Event, SNS, KMS and Lambda function.
- Used Cloud watch logs to move application logs to S3 and create alarms based on a few exceptions raised by applications.
- Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analysing data.
- Determining the viability of a business problem for a Big Data solution with Pyspark.
- Proactively monitored systems and services, architecture design and implementation of Hadoop deployment, configuration management, backup, and disaster recovery systems and procedures.
- Monitored multiple Hadoop clusters environments using Ganglia and Monitored workload, job performance and capacity planning using MapR.
- Involved in time series data representation using HBase.
- Great working experience with Splunk for real time log data monitoring.
- Build cluster on AWS environment using EMR using S3, EC2, Redshift.
- Worked with databricks for connecting the different sources and transforming data to store in cloud platform.
- Experienced in building extensible data integration and data acquisition solutions to meet the requirement of the business.
- Experienced in building optimized data integration platform to provide efficient performance under developing data volumes.
- Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports by our BI team.
- Worked with Devops team to Clusterize NIFI Pipeline on EC2 nodes integrated with Spark, Kafka, Postgres running on other instances using SSL handshakes in QA and Production Environments.
- Great hands on experience with Pyspark for using Spark liberties by using python scripting for data analysis.
- Worked with (BI)Tableau teams as requirement of datasets and good working experience with Data visualization.
Environment: MapReduce, AWS, S3, EC2, EMR, RedShift, Glue, Java, HDFS, Hive, Pig, Tez, Oozie, HBase, Spark, Scala, Spark SQL, Kafka, Python, Putty, Pyspark, Cassandra, Shell Scripting, ETL, YARN, Splunk, Sqoop, LINUX, Cloudera, Ganglia, SQL Server.
BigData Engineer
Confidential, CA
Responsibilities:
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
- Developed Spark jobs using Scala on top of Yarn/MRv2 for interactive and Batch Analysis.
- Performing Map Reduce jobs in Hadoop and implemented Spark analysis using Python for performing machine learning & predictive analytics on AWS platform.
- Involved in querying data using SparkSQL on top of Spark engine for faster data sets processing.
- Worked on implementing Spark Framework a Java based Web Frame work.
- Developed Python code to gather the data from HBase and designs the solution to implement using Pyspark.
- Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing data.
- Worked with Apache SOLR to implement indexing and wrote Custom SOLR query segments to optimize the search.
- Extracted data from HDFS using Hive, Presto and performed data analysis using Spark with Scala, PySpark, Redshift and feature selection and created nonparametric models in Spark
- Written java code to format XML documents, uploaded them to Solr server for indexing.
- Experienced on Apache Solr for indexing and load balanced querying to search for specific data in larger datasets and implemented Near Real Time Solr index on Hbase and HDFS.
- Worked on Ad hoc queries, Indexing, Replication, Load balancing, Aggregation in MongoDB.
- Processed the Web server logs by developing Multi-hop flume agents by using Avro Sink and loaded into MongoDB for further analysis, also extracted files from MongoDB through Flume and processed.
- MongoDB NoSQL data modeling, tuning, disaster recovery backup used it for distributed storage and processing using CRUD.
- Involved in entire data ingestion process for dealing with both structured and unstructured data.
- Worked in databricks to create a ETL pipeline for extracting customers data.
- Extracted and restructured the data into MongoDB using import and export command line utility tool.
- Wrote Flume configuration files for importing streaming log data into HBase with Flume.
- Imported several transactional logs from web servers with Flume to ingest the data into HDFS. Using Flume and Spool directory for loading the data from local system (LFS) to HDFS.
- Experienced in building enhanced information combination stage to give effective execution under creating information volumes by using snowflake schema and Tera data.
- Good working knowledge on snowflake, Teradata, databases and performed analytical applications by making use of spark with hive, SQL/oracle/Snowflake.
- Experiencing in using snowflake clone and time travel.
- Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
- Migrated an existing on-premises application to AWS.
- Used AWS services like EC2 and S3 for small data sets.
- Created Partitioned Hive tables and worked on them using HiveQL.
- Loading Data into HBase using Bulk Load and Non-bulk load.
- Installed, Configured Talend ETL on single and multi-server environments.
- Monitoring Hadoop cluster using Cloudera Manager, interacting with Cloudera support and log the issues in Cloudera portal and fixing them as per the recommendations.
- Involved in Cloudera Hadoop Upgrades and Patches and Installation of Ecosystem Products through Cloudera manager along with Cloudera Manager Upgrade.
- Worked on Continuous Integration tools Jenkins and automated jar files at end of day.
- Worked with Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server.
- Developed data pipeline expending Pig and Java MapReduce to consume customer behavioral data and financial antiquities into HDFS for analysis
- Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
- Developed Unix shell scripts to load large number of files into HDFS from Linux File System.
- Involved in setting up the whole app stack, setup and debug log stash to send Apache logs to AWS Elastic search.
- Created users, roles and groups for securing the resources using local operating system authentication in azure.
- Perform troubleshooting and diagnosis to hardware/software network failures and provide resolutions using azure.
- Involved in entire ETL process while dealing with structured and unstructured data
- At each stage of ETL process queries have been executed to validate and updated the client with relevant values at each phase
- Served as the Snowflake Database Administrator responsible for leading the data model design and database migration deployment production releases to endure our database objects and corresponding metadata were successfully implemented to the production platform environments AWS Cloud (Snowflake).
- Deploying and managing applications in Datacenter, Virtual environment and Azure platform as well
- Used Zookeeper to coordinate the servers in clusters and to maintain the data consistency.
- Worked in Agile development environment having KANBAN methodology. Actively involved in daily Scrum and other design related meetings.
- Used OOZIE Operational Services for batch processing and scheduling workflows dynamically.
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop.
Environment: Hadoop, HDFS, Hive, Map Reduce, Azure, AWS, EC2, S3, RedShift, SOLR, Impala, MySQL, Oracle, Sqoop, Kafka, Spark, SQL Talend, Python, PySpark, Yarn, Pig, Oozie, Linux-Ubuntu, Scala, Tableau, Maven, Jenkins, Java, Cloudera, snowflake, JUnit, agile methodologies.
Data Engineer
Confidential
Responsibilities:
- In depth understanding of Hadoop Architecture and various components such as HDFS, Application master, Node Manager, Resource Manager, Name Node, Data node and MapReduce concepts.
- Imported required tables from RDBMS to HDFS using Sqoop and also used Storm and Kafka to get real time streaming of data into HBase.
- Used NoSQL database Hbase and creating Hbase tables to load large sets of semi structured data coming from various sources.
- Wrote Hive and Pig scripts as ETL tool to do transformations, event joins, filter both traffic and some pre-aggregations before storing into the HDFS.
- Developed data pipeline using Flume, Sqoop, Pig and MapReduce to ingest customer behavioral data.
- Developed Spark code using Scala and Spark-SQL for faster testing and processing of data.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
- Developed Java code to generate, compare & merge AVRO schema files.
- Developed complex MapReduce streaming jobs using Java language that are implemented Using Hive and Pig and using MapReduce Programs using Java to perform various ETL, cleaning and scrubbing tasks.
- Prepared the validation report queries, executed after every ETL runs, and shared the resultant values with business users in different phases of the project.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting & used the hive optimization techniques during joins and best practices in writing hive scripts using HiveQL.
- Importing and exporting data into HDFS and Hive using Sqoop. Writing the HIVE queries to extract the data processed.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and Python.
- Experienced in data ingestion from different data sources and creating a optimal strategy for further analysis.
- Experienced in creating a ETL pipeline for batch and streaming data.
- Developing and running Map-Reduce Jobs on YARN and Hadoop clusters to produce daily and monthly reports as per user's need.
- Imported data from AWS S3 and into Spark RDD and performed transformations and actions on RDD's.
- Teamed up with Architects to design Spark model for the existing MapReduce model and Migrated MapReduce models to Spark Models using Scala.
- Implemented Spark using Scala and utilizing SparkCore, Spark Streaming and SparkSQL API for faster processing of data instead of MapReduce in Java.
- Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled Structured data using Spark SQL
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop
- Integrated Apache Storm with Kafka to perform web analytics and to perform click stream data from Kafka to HDFS.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java MapReduce Hive, Pig, and Sqoop.
- Analyzed large and critical datasets using Cloudera, HDFS, HBase, MapReduce, Hive, Pig, Sqoop, Spark and Zookeeper.
- Expert knowledge on MongoDB NoSQL data modeling, tuning, disaster recovery and backup.
Environment: Apache Hadoop, AWS, Python, HDFS, MapReduce, HBase, Hive, Yarn, Pig, Sqoop, Flume, Zookeeper, Kafka, Impala, SparkSQL, Spark Core, Spark Streaming, NoSQL, MySQL, Cloudera, Java, JDBC, Spring, ETL, WebLogic, Web Analytics, Avro, Cassandra, Oracle, Shell Scripting, Ubuntu.
Python Developer
Confidential
Responsibilities:
- Assess the infrastructure needs for each application and deploy it on Azure platform.
- Build and Deployed the code artifacts into the respective environments in the Confidential Azure cloud.
- Deployed and Published Django Web App in platform as a services PaaS in azure App services
- Created Non-Prod and Prod Environments in Azure from scratch.
- Worked on various Azure services like Compute (Web Roles, Worker Roles), Azure Websites, Caching, SQL Azure, NoSQL, USQLS, Storage, Network services, Data Factory, Azure Active Directory, API Management, Scheduling and Auto Scaling.
- Developed U-SQL Scripts for schematizing the data in Azure Data Lake Analytics.
- Experience of process and transform data by running USQL scripts on Azure.
- Designed the user interface and client-side scripting using AngularJS framework, Bootstrap and JavaScript.
- Created User Interface Design using HTML5, CSS3, JavaScript, jQuery, JSON, REST and AngularJS, Bootstrap.
- Developed GUI using JavaScript, HTML/HTML5, DOM, AJAX, CSS3, CQ5 and AngularJS in ongoing projects.
Environment: Azure Kubernetes services, Container Services, Model management, Terraform, Docker, Python, Django, HTML5, CSS3, JavaScript, jQuery, Ajax, Bootstrap, GitHub, VSTS.
