Data Engineer Resume
SUMMARY
- Data Engineer/Machine Learning Engineer with 7+ years of professional work experience in Big Data, Machine Learning and Data Analytics Platform.
- Expertise in using advanced statistical methods and machine learning algorithms using SQL, R and Python.
- A collaborative engineering professional with substantial experience designing and executing solutions for complex business problems involving large scale data warehousing, real - time analytics and reporting solutions.
- Build large scale big data applications and data pipelines that deliver insights using heterogeneous data sets.
- Hands-on experience in Hadoop, Spark, Scala, Big Data, Data Science and Big Data Architecture & Development.
- Experience in Statistic Modelling, Predictive Modelling, Data Analytics, Data Modelling, Data Analysis, Data Mining, Text Mining and Natural Language Processing (NLP) algorithms.
- Experience in utilizing analytical applications like R and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
- Extensive experience in Text Analytics, developing different Statistical Machine Learning solutions to various business problems and generating data visualizations using R and Python.
- Hands on experience on R packages and libraries like caret, ggplot2, dplyr, algorithmics, e1071, ROSE, epiR, ggviz etc.
- Used Pandas, NumPy, SciPy, Matplotlib, Scikit-learn, and NLTK in Python for developing various machine learning algorithms.
- Experience of using Artificial Intelligence in Text Analytics and developing different Statistical Machine Learning, Data Mining Solutions to various business problems and generating data visualizations using R, Python and creating dashboards using tools like Tableau.
- Extensive experience on usage of ETL& Reporting tools like SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS) Extensive experience in performing ETL on structured, semi-structured data using Pig Latin Scripts.
- Experience in development and support knowledge on Oracle, SQL, PL/SQL, T-SQL queries.
- Extensively experience on Excel Pivot tables to run and analyze the result data set and perform UNIX scripting.
- Involved in all stage of the data product development life cycle; from coordinating requirements with business, through development with SME and data scientists, to production deployments with developers.
TECHNICAL SKILLS
Programming Languages: Python and basics of R and Java programming
Packages: ggplot2, caret, dplyr, Rweka, gmodels, RCurl, Twitter, NLP, Reshape2, rjson, dplyr, pandas, NumPy, Seaborn, SciPy, Matplot lib, Scikit - learn, Beautiful Soup, Rpy2.
Reporting Tools: Tableau, SAS BI, Microsoft Power BI
Databases: SQL, Hive, Spark SQL, MYSQL
Big Data Technologies: Spark and Hadoop
ETL Tools: SSIS
Operating System: Windows, Linux/UnixWork Experience
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential
Responsibilities:
- Perform Data Profiling to learn about user behavior and merge data from multiple data sources.
- Implemented big data processing applications to collect, clean and normalization large volumes of open data using Hadoop ecosystems such as PIG, HIVE, and HBase.
- Designing and developing various machine learning frameworks using Python, R, and MATLAB.
- Integrate R into Micro Strategy to expose metrics determined by more sophisticated and detailed models than natively available in the tool.
- Implemented flagship product with strong OOP design for scalability.
- Independently coded new programs and designed Tables to load and test the program effectively for the given POC's using with Big Data/Hadoop.
- Solution architecting BIG Data solution for Projects & Proposal using Hadoop, Spark, ELK Stack, Kafka, Tensor flow.
- Correct minor data errors that prevent loading of EDI files.
- Develop documents and dashboards of predictions in MicroStrategy and present it to the Business Intelligence team.
- Used Cloud Vision API integrate vision to detection features within applications, including image labeling, face and landmark detection, optical character recognition (OCR), and tagging of explicit content.
- Implemented Text mining to transposing words and phrases in unstructured data into numerical values.
- Developed various QlikView Data Models by extracting and using the data from various sources files, DB2, Excel, Flat Files and Bigdata.
- Good knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and MapReduce concepts.
- As Architect delivered various complex OLAP databases/cubes, scorecards, dashboards and reports.
- Track and enable communication across multiple departments to make sure all parties are as educated about potential issues as they can be.
- Utilized human face recognition OpenCV and tackled the challenge of long running time on personal computer for face.
- Programmed a utility in Python that used multiple packages (SciPy, NumPy, pandas)
- Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN, Naive Bayes.
- Gained knowledge about OpenCV and learned to apply it to achieve the red color object identifying with the drone's camera.
- Used Teradata15 utilities such as Fast Export, MLOAD for handling various tasks data migration/ETL from OLTP Source Systems to OLAP Target Systems.
- Handled importing data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
- Collaborate with data engineers to implement ETL process, write and optimized SQL queries to perform data extraction from Cloud and merging from Oracle 12c.
- Experienced in Delivery, Portfolio, Team / Career, Vendor and Program Management Competency in Solution Architecture, implementation & delivery of Big Data, data science analytics & DWH projects on Greenplum, SPARK, Python and TensorFlow.
- Coordinate the execution of A/B tests to measure the effectiveness of personalized recommendation system.
- Perform data visualization with Tableau 10 and generate dashboards to present the findings.
- Recommend and evaluate marketing approaches based on quality analytics of customer consuming behavior.
- Determine customer satisfaction and help enhance customer experience using NLP.
Environment: MATLAB, MongoDB, exploratory analysis, feature engineering, K-Means Clustering, Hierarchical Clustering, Machine Learning), Python, Spark (MLLib, PySpark), Tableau, MicroStrategy, Git,Unix,, MLLib, SAS, Tensor Flow, regression, logistic regression, Hadoop 2.7, OLTP, random forest, OLAP, HDFS, ODS, NLTK, SVM, JSON, XML and MapReduce, OpenCV.
Data Scientist/Big data engineer
Confidential
Responsibilities:
- Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data lifecycle management in both RDBMS, Big Data environments.
- Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
- Developed MapReduce/SparkPython modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Performed Source System Analysis, database design, data Modeling for the warehouse layer using MLDM concepts and package layer using Dimensional Modeling.
- Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within the enterprise data architecture (conceptual data model for defining the major subject areas used, ecosystem logical model for defining standard business meaning for entities and fields, and an ecosystem canonical model for defining the standard messages and formats to be used in data integration services throughout the ecosystem)
- Developed Linux Shell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
- Designed and implemented system architecture for Confidential EC2 based cloud-hosted solution for the client.
- Tested Complex ETL Mappings and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to target tables.
- Hands-on database design, relational integrity constraints, OLAP, OLTP, Cubes and Normalization (3NF) and De-normalization of the database.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Worked on customer segmentation using an unsupervised learning technique - clustering.
- Worked with various Teradata15 tools and utilities like Teradata Viewpoint, Multi-Load, ARC, Teradata Administrator, BTEQ and other Teradata Utilities.
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction.
- Analysed large data sets apply machine learning techniques and develop predictive models, statistical models and developing and enhancing statistical models by leveraging best-in-class Modeling techniques.
Environment: Erwin r9.6, Python, SQL, Oracle 12c, Netezza, SQL Server, Informatica, Java, SSRS, PL/SQL, T-SQL, Tableau, MLLib, regression, Cluster analysis, Scala NLP, Cassandra, MapReduce, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata0, random forest, OLAP, Azure, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, AWS.
Data Engineer
Confidential, Seattle, WA
Responsibilities:
- Work hands-on to define architecture / blueprints and advise clients on technology strategy.
- Provided technical expertise and created software design proposals for upcoming components.
- Hands-on working experience using Azure Cloud, Data Lake Storage which is a highly scalable and secure storage for big data analytics. In Data Lake Store, it holds Vast amount of Raw data in its native format using a flat architecture to store data.
- Hands-on work for creating Spark cluster in both HDInsight's and Azure Databricks environment.
- Implemented Spark using PySpark and SparkSQL for faster testing and Processing of data instead of MapReduce in JAVA. PySpark programming is used for data transformation.
- Involved in designing or migrating existing systems using HDInsight, Spark, Hive, Pig, Sqoop, HDFS.
- Hands on experience in Azure, preferably data heavy / analytics applications leveraging relational and NoSQL databases, data warehouse and big data.
- Experience in performance tuning of complex ETL mappings for relational and non-relational workload.
- Develop Data ingestion/ETL pipeline to load data from data sources identified on daily basis using Azure Data Factory, a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
- ETL Framework will follow a pull mechanism to pull the data from various designated sources and will store, transform and prepare the data for the reporting requirements.
- Azure SQL is used to store the prepared data in tables from ETL pipelines for the reporting requirements.
- Integrated and created Job Schedulers for the data aggregation and analytic models that were developed by Data Scientists to run the analytics on top of HDInsight and Cloud ML.
- Develop Data visualization using Power BI.
- Good understanding of the full software development life cycle.
- Re-designed and developed a critical ingestion pipeline to process vast amount of data.
Environment: Python, PySpark, Azure Data Factory, Data Lake Store, PowerBI, Azure SQL, Azure ML, SSIS, HDInsight, Azure Data Bricks